Introduction

The rapid global trend toward urbanization poses significant challenges for ensuring that rural populations are not left behind in development and policy considerations. Tracking vital statistics for these populations is increasingly difficult, leading to a shortage of reliable, up-to-date data. Private data, such as social media data, has shown the potential to complement or supplement the existing public data. This study explores using Facebook Advertising data, which offers a digital “census” of over two billion users, to measure and analyze rural-urban inequalities.

This vignette utilizes rSocialWatcher (Marty, 2020) to query Facebook Marketing data to demonstrate this packages potential for Social Science research. Facebook data is utilized to analyze statistically significant differences between chosen behaviors in urban and rural regions in Connecticut, United States. This vignette is based on Facebook Ads as a Demographic Tool to Measure the Urban-Rural Divide (Daniele Rama et al., 2020) which similarly uses Facebook Marketing data to study the urban-rural divide in Italy.

Setup

Below we load relevant packages and define Facebook keys.

## Load packages
library(rsocialwatcher)
library(dplyr)
library(tidyr)
library(ggplot2)
library(WDI)
library(janitor)
library(leaflet)
library(ggpubr)
library(knitr)
library(kableExtra)
library(sf)
library(spData)

TOKEN      <- "Token"
VERSION <- "Version"
CREATION_ACT <- "Creation Act"

Part 1: iOS and Android vs Household Income

CT Income Data From U.S. Census

From the United States Census, we acquired American Community Survey 2020 household income data. We then select the following columns:

  • Name: Geographic Area Name
  • S1903_C03_001E: Estimate - Median income (dollars) - HOUSEHOLD INCOME BY RACE AND HISPANIC OR LATINO ORIGIN OF HOUSEHOLDER
  • S1903_C01_001E: Estimate - Estimate - Number - HOUSEHOLD INCOME BY RACE AND HISPANIC OR LATINO ORIGIN OF HOUSEHOLDER
ct_income_df <- ct_income_data %>% 
    select(NAME, S1903_C01_001E, S1903_C03_001E)
ct_income_df <- ct_income_df[11:nrow(ct_income_df), 1:3]
ct_income_df$ZipCode <- as.integer(sub("ZCTA5 ", "", ct_income_df$NAME))

Querying ZIP Code Data from Facebook

As Connecticut Zip Codes all being with “06”, we pull Facebook data for all Connecticut zip codes. We then visualize the regions by household income with Leaflet.

CT_zip_df <- get_fb_parameter_ids(type = "zip", 
                               q = "06",
                               country_code = "US",
                               version = VERSION, 
                               token = TOKEN)
head(CT_zip_df) %>% kable
key name type country_code country_name region region_id primary_city primary_city_id supports_region supports_city
US:06902 06902 zip US United States Connecticut 3849 Stamford 2425322 TRUE TRUE
US:06023 06023 zip US United States Connecticut 3849 Berlin 2424573 TRUE TRUE
US:06068 06068 zip US United States Connecticut 3849 Salisbury 2425242 TRUE TRUE
US:06277 06277 zip US United States Connecticut 3849 Thompson 2425358 TRUE TRUE
US:06786 06786 zip US United States Connecticut 3849 Plymouth 2425170 TRUE TRUE
US:06441 06441 zip US United States Connecticut 3849 Haddam 2424844 TRUE TRUE
zip_ct_sf$name <- as.integer(zip_ct_sf$name)
income_sf <- st_as_sf(ct_income_df %>% 
  left_join(zip_ct_sf, by = c("ZipCode" = "name")) %>%
  select("ZipCode","geometry", "S1903_C03_001E")) 

income_sf$S1903_C03_001E <- as.integer(gsub("[^0-9.-]", "", income_sf$S1903_C03_001E))
#> Warning: NAs introduced by coercion

pal <- colorNumeric(palette = "viridis", domain = income_sf$S1903_C03_001E)

leaflet(data = income_sf) %>%
  addTiles() %>%
  addPolygons(
    fillColor = ~pal(as.integer(S1903_C03_001E)),
    weight = 2,
    color = "white",
    fillOpacity = 0.7,
    smoothFactor = 0.5,
    popup = ~paste(ZipCode, ": $", S1903_C03_001E, sep = "")
  )
which(income_sf$S1903_C03_001E < 20000)
#> [1] 214
income_sf[214,]
#> Simple feature collection with 1 feature and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -73.0573 ymin: 41.54603 xmax: -73.0205 ymax: 41.57236
#> Geodetic CRS:  WGS 84
#>     ZipCode S1903_C03_001E                       geometry
#> 214    6702          11663 MULTIPOLYGON (((-73.05539 4...

Total Facebook Users in CT

We query Facebook to get the Estimate DAU (Daily Active Users) for each zip code. We select only the zip code and the DAU.

total_people_ct_df <- query_fb_marketing_api(
    location_unit_type = "zip",
    location_keys      = map_param_vec(zip_ct_sf$key[1:length(zip_ct_sf$key)]),
    version            = c(VERSION,      VERSION1) ,
    creation_act       = c(CREATION_ACT, CREATION_ACT1)
    token              = c(TOKEN,        TOKEN1) )
X estimate_dau estimate_mau_lower_bound estimate_mau_upper_bound location_unit_type location_types location_keys gender age_min age_max api_call_time_utc
1 70376 85100 100100 zips home or recent US:06902 1 or 2 18 65 2024-04-29 09:11:38.661257
2 832 1000 1000 zips home or recent US:06023 1 or 2 18 65 2024-04-29 09:11:39.381142
3 1152 1400 1600 zips home or recent US:06068 1 or 2 18 65 2024-04-29 09:11:40.306598
4 3377 4200 5000 zips home or recent US:06277 1 or 2 18 65 2024-04-29 09:11:41.243462
5 5609 7000 8200 zips home or recent US:06786 1 or 2 18 65 2024-04-29 09:11:41.929814
6 2763 4100 4900 zips home or recent US:06441 1 or 2 18 65 2024-04-29 09:11:42.942949
total_people_ct_df$ZipCode <- sub("US:", "", total_people_ct_df$location_keys)

total_people_ct_df <- total_people_ct_df %>%
  select("ZipCode","estimate_dau")
head(total_people_ct_df) %>% kable
ZipCode estimate_dau
06902 70376
06023 832
06068 1152
06277 3377
06786 5609
06441 2763

Pulling iOS data vs Android

These are the keys for people using iOS devices and Androids respectively:

behaviors_df <- get_fb_parameter_ids("behaviors", VERSION, TOKEN)
beh_mobile_ids = c("6004384041172", "6004386044572")

fb_df <- query_fb_marketing_api(
    location_unit_type = "zip",
    location_keys      = map_param_vec(zip_ct_sf$key[1:length(zip_ct_sf$key)]),
    behaviors          = map_param_vec(beh_mobile_ids),
    version            = c(VERSION,      VERSION1) ,
    creation_act       = c(CREATION_ACT, CREATION_ACT1)
    token              = c(TOKEN,        TOKEN1) )
p_1a <- income_sf %>%
  ggplot() +
  geom_histogram(aes(x = S1903_C03_001E),
                 fill = "#4267B2",
                 color = "black",
                 bins = 30) +
  labs(x = "Income",
       y = "N ZipCodes",
       title = "Income Distribution of Zip Codes in CT") +
  theme_classic2()

p_1b <- total_people_ct_complete %>%
  ggplot() +
  geom_histogram(aes(x = estimate_dau),
                 fill = "#4267B2",
                 color = "black",
                 bins = 30) +
  # scale_x_log10() +
  labs(x = "Daily Active Users",
       y = "N ZipCodes",
       title = "Daily Active User Distribution of Zip Codes in CT") +
  theme_classic2()

ggarrange(p_1a, p_1b, nrow = 1, widths = c(0.45, 0.55))

Cleaning Data

To clean the data for understanding the relationship between iOS/Android devices and income, we filter our datasets so that we only consider zip codes with over 2000 Facebook dau and over 1000 Facebook iOS/Android dau.

ct_income_df$ZipCode <- as.integer(ct_income_df$ZipCode)
fb_df$ZipCode <- as.integer(fb_df$ZipCode)
ios_data <- fb_df %>%
  filter(behaviors == 6004384041172) %>%
  inner_join(ct_income_df, by = "ZipCode") %>%
  inner_join(total_people_ct_complete, by = "ZipCode") %>%
  filter(estimate_dau.y > 2000) %>%
  filter(estimate_dau.x > 500) %>%
  mutate(percent_ios = estimate_dau.x/estimate_dau.y) %>%
  filter(percent_ios < 1) %>%
  select("ZipCode","percent_ios","S1903_C03_001E")
android_data <- fb_df %>%
  filter(behaviors == 6004386044572) %>%
  inner_join(ct_income_df, by = "ZipCode") %>%
  inner_join(total_people_ct_complete, by = "ZipCode") %>%
  filter(estimate_dau.y > 2000) %>%
  filter(estimate_dau.x > 500) %>%
  filter(estimate_dau.y > estimate_dau.x) %>%
  mutate(percent_android = estimate_dau.x/estimate_dau.y) %>%
  select("ZipCode","percent_android","S1903_C03_001E")

combined_data <- ios_data %>%
  inner_join(android_data, by= "ZipCode")
p_2a <- ios_data %>%
  ggplot() +
  geom_histogram(aes(x = percent_ios),
                 fill = "#4267B2",
                 color = "black",
                 bins = 30) +
  labs(x = "Percent iOS Users",
       y = "N ZipCodes",
       title = "iOS Distribution of Zip Codes") +
  scale_x_continuous(labels = scales::percent) +
  theme_classic2()

p_2b <- android_data %>%
  ggplot() +
  geom_histogram(aes(x = percent_android),
                 fill = "#4267B2",
                 color = "black",
                 bins = 30) +
  # scale_x_log10() +
  scale_x_continuous(labels = scales::percent) +
  labs(x = "Percent Android Users",
       y = "N ZipCodes",
       title = "Android Distribution of Zip Codes") +
  theme_classic2()

p_2c <- combined_data %>%
  ggplot() +
  geom_histogram(aes(x = 1 - percent_android - percent_ios),
                 fill = "#4267B2",
                 color = "black",
                 bins = 30) +
  # scale_x_log10() +
  scale_x_continuous(labels = scales::percent) +
  labs(x = "Percent Other Users",
       y = "N ZipCodes",
       title = "Other Distribution of Zip Codes") +
  theme_classic2()

ggarrange(p_2a, p_2b, p_2c, nrow = 1, widths = c(0.33, 0.33,0.33))

Visualization

p_theme <- theme(plot.title = element_text(face = "bold", size = 10),
                 plot.subtitle = element_text(face = "italic", size = 10),
                 axis.text = element_text(color = "black"),
                 axis.title = element_text(size = 10))
ios_plot <- ggplot(data = ios_data) +
  geom_smooth(method = "lm", se = TRUE, aes(x = as.numeric(percent_ios), y = as.numeric(S1903_C03_001E)), color = "blue") +
  labs(x = "% Users Access through iOS",
       y = "Income") +
  xlim(0.2, 0.6) +
  ylim(0, 220000) +
  geom_jitter(mapping = aes(x = as.numeric(percent_ios), y = as.numeric(S1903_C03_001E)), width = 0,
              height = 0.1,
              pch = 21,
              size = 2,
              fill = "red") +
  labs(x = "% Users Access through iOS",
       y = "Income",
       title = "B. Income vs. % of Facebook users with iOS") + 
  theme_classic2() +
  p_theme
android_plot <- ggplot(data = android_data) +
  geom_smooth(method = "lm", se = TRUE, aes(x = as.numeric(percent_android), y = as.numeric(S1903_C03_001E)), color = "blue") +
  xlim(0.2,0.6) +
  ylim(0, 220000) +
  geom_jitter(mapping = aes(x = as.numeric(percent_android), y = as.numeric(S1903_C03_001E)), width = 0,
              height = 0.1,
              pch = 21,
              size = 2,
              fill = "red") +
  labs(x = "% Users Access through Android",
       y = "Income",
       title = "B. Income vs. % of Facebook users with Android") + 
  theme_classic2() +
  p_theme
ggarrange(ios_plot, android_plot, nrow = 1)


Part 2: Differences in Interests and Behaviors between Urban and Rural Zip Codes

This part of the vignette will more generally study urban and rural divide in interests/behaviors, as defined by Facebook Marketing, between zip codes in Connecticut. Without access to private social media data, such studies would be incredibly difficult and expensive to execute.

Definining Urban vs Rural

First, we must define if a Zip Code is urban or rural. We take two approaches to do this, The first method is to create a binary classifier based on the distance from an urban area in Connecticut (i.e. if a Zip Code is within a certain distance from a city, it is considered urban). The other way is to consider the distance from the center of a zip code to the nearest city.

First we select the cities in Connecticut with a population greater than 100,000 (Waterbury, Stanford, Hartford, New Haven, Bridgeport). We consider any zip code wihtin 5 kilometers of a city as an urban area.

us_states_df <- get_fb_parameter_ids(type    = "region",
                                     version = VERSION, 
                                     token   = TOKEN,
                                     country_code = "US")
#> No encoding supplied: defaulting to UTF-8.

ct_key <- us_states_df %>% filter(name == "Connecticut") %>% pull(key) 

ct_cities_df <- get_fb_parameter_ids(type    = "city",
                                     version = VERSION, 
                                     token   = TOKEN,
                                     region_id = ct_key,
                                     q = "Connecticut")
#> No encoding supplied: defaulting to UTF-8.

city_keys <- ct_cities_df %>%
  filter(type == "city") %>%
  filter(name %in% c("Waterbury","New Haven", "Bridgeport","Stamford", "Hartford")) %>%
  pull(key)

ct_cities_coords <- get_location_coords(location_unit_type = "city",
                                 location_keys = city_keys,
                                 version = VERSION,
                                 token = TOKEN)
#> No encoding supplied: defaulting to UTF-8.

ct_cities_sf <- st_as_sf(ct_cities_coords, coords = c("longitude", "latitude"), crs = 4326, agr = "constant")

cities_buffered <- st_buffer(ct_cities_sf, dist = 5000)

leaflet() %>%
  addTiles() %>%
  addCircles(data = ct_cities_sf,
              popup = ~name,
             radius = 600, 
             color = "red") %>%
  addPolygons(data = cities_buffered,
              popup = ~name)

Based off of our defintion, we can now classify counties into urban or rural. We also save distances between the center of Zip Codes and the closest cities for our second definition.

zip_ct_sf_valid <- st_make_valid(zip_ct_sf)
sf_object <- st_simplify(zip_ct_sf_valid, preserveTopology = TRUE)
centroids <- st_centroid(sf_object)
#> Warning: st_centroid assumes attributes are constant over geometries
distances <- st_distance(centroids, ct_cities_sf)
min_distances <- apply(distances, 1, min)
zip_ct_sf <- cbind(zip_ct_sf, min_distances)

intersection <- st_intersection(cities_buffered, zip_ct_sf_valid)
#> Warning: attribute variables are assumed to be spatially constant throughout
#> all geometries

intersecting_zips <- data.frame(name.1 = unique(intersection$name.1))

zip_ct_sf_valid$intersects_city <- FALSE

zip_ct_sf_valid <- zip_ct_sf_valid %>%
  mutate(intersects_city = name %in% intersecting_zips$name.1)

leaflet() %>%
  addTiles() %>%
  addPolygons(data = zip_ct_sf_valid,
              popup = ~name,
              color = ~ifelse(intersects_city, "red", "blue"),
              fillColor = ~ifelse(intersects_city, "red", "blue"),
              fillOpacity = 0.5)  %>%
addCircles(data = ct_cities_sf,
              popup = ~name,
             radius = 600, 
             color = "black") %>%
  addPolygons(data = cities_buffered,
              popup = ~name,
              color = "black")

Querying Facebook Behaviors and Interests

We select certain interests and behaviors that were also studied in Rama et al. 2020 to compare differences. The interests and behaviors of note are shown below. We use rSocialWatcher to pull and save the data here from Facebook Marketplace.

interests_df <- get_fb_parameter_ids("interests", VERSION, TOKEN)
#> No encoding supplied: defaulting to UTF-8.
behaviors_df <- get_fb_parameter_ids("behaviors", VERSION, TOKEN)
#> No encoding supplied: defaulting to UTF-8.

selected_behaviors <- c("Frequent Travelers",
                        "Frequent international travelers",
                        "Technology early adopters")

behaviors_id <- behaviors_df %>%
  filter(name %in% selected_behaviors) %>%
  pull(id)


selected_interests <- c("Gambling (gambling)",
                        "Organic food (food & drink)",
                        "Concerts (music event)",
                        "Fast food restaurants (dining)",
                        "Tattoos (body art)",
                        "Reading (communication)",
                        "Restaurants (dining)",
                        "Fitness and wellness (fitness)",
                        "Cooking (food & drink)",
                        "Vegetarianism (diets)",
                        "Jazz music (music)",
                        "Hunting (sport)",
                        "Fine art (visual art)",
                        "Politics (politics)")

interests_id <- interests_df %>%
  filter(name %in% selected_interests) %>%
  pull(id)
for (behavior in behaviors_id[2:length(behaviors_id)]) {
  fb_df <- query_fb_marketing_api(
      location_unit_type = "zip",
      location_keys      = map_param_vec(zip_ct_sf$key[1:length(zip_ct_sf$key)]),
      behaviors          = behavior,
      version            = c(VERSION, VERSION1),
      creation_act       = c(CREATION_ACT, CREATION_ACT1),
      token              = c(TOKEN, TOKEN1)
  )
  
  write.csv(fb_df, file = paste0("data/", behavior, ".csv"))
  
  print(paste("Done with", behavior))
}
for (interest in interests_id[3:4]) {
  fb_df <- query_fb_marketing_api(
      location_unit_type = "zip",
      location_keys      = map_param_vec(zip_ct_sf$key[1:length(zip_ct_sf$key)]),
      interests          = interest,
      version            = c(VERSION, VERSION1),
      creation_act       = c(CREATION_ACT, CREATION_ACT1),
      token              = c(TOKEN, TOKEN1)
  )
  
  write.csv(fb_df, file = paste0("data/", interest, ".csv"))
  
  print(paste("Done with", interest))
}

Analysis

Using the data queried, we perform our analysis. First we run a regression based on the distances from the cities. The regression coefficients between urban and rural centers is then used to denote the urban/rural difference. We then use the binary classification between rural and urban counties. Here, we measure the differences in the mean, and use a t-test to determine the 95% confidence interval.

Below are the results. Note that the results between the two methodologies returns relatively similar answers.

lr_result_df <- data.frame(Coefficient = numeric(), Lower_CI = numeric(), Upper_CI = numeric(), behavior = character())

mapply(function(x, y) {
  data <- read.csv(paste0("data/", y,".csv"))
  
  data$ZipCode <- sub("US:", "", data$location_keys)
  zip_ct_sf$ZipCode <- as.integer(sub("US:", "", zip_ct_sf$key))
  data$ZipCode <- as.integer(data$ZipCode)
  ct_income_df$ZipCode <- as.integer(ct_income_df$ZipCode)

  merged_data <- data %>%
      inner_join(total_people_ct_complete, by = "ZipCode") %>%
      filter(estimate_dau.y > 2000) %>%
      filter(estimate_dau.x > 500) %>%
      mutate(percent_pop = estimate_dau.x/estimate_dau.y) %>%
      inner_join(zip_ct_sf, by = "ZipCode") %>%
      select("ZipCode","percent_pop","min_distances")
  lm_model <- lm(percent_pop ~ min_distances, data = merged_data)
  
  min_distance_coef <- coef(lm_model)["min_distances"]
  conf_intervals <- confint(lm_model)
  min_distance_conf_int <- conf_intervals["min_distances", ]

  lr_result_df <<-  rbind(lr_result_df, data.frame(
      Coefficient = min_distance_coef,
      Lower_CI = min_distance_conf_int[1],
      Upper_CI = min_distance_conf_int[2],
      behavior = x
  ))
  print(paste0("Done with ", x[1]))
}, c(selected_behaviors, selected_interests[4:14]), c(behaviors_id, interests_id[4:14]))
mean_result_df <- data.frame(Difference_in_Means = numeric(), Lower_CI = numeric(), Upper_CI = numeric(), behavior = character())

mapply(function(x, y) {
  data <- read.csv(paste0("data/", y,".csv"))
  
  data$ZipCode <- sub("US:", "", data$location_keys)
  zip_ct_sf_valid$ZipCode <- as.integer(sub("US:", "", zip_ct_sf_valid$key))
  data$ZipCode <- as.integer(data$ZipCode)
  ct_income_df$ZipCode <- as.integer(ct_income_df$ZipCode)

  merged_data <- data %>%
      inner_join(total_people_ct_complete, by = "ZipCode") %>%
      filter(estimate_dau.y > 2000) %>%
      filter(estimate_dau.x > 500) %>%
      mutate(percent_pop = estimate_dau.x/estimate_dau.y) %>%
      inner_join(zip_ct_sf_valid, by = "ZipCode") %>%
      select("ZipCode","percent_pop","intersects_city")
  
  # Grouping data by intersects_city
  group1 <- merged_data$percent_pop[merged_data$intersects_city == 0]
  group2 <- merged_data$percent_pop[merged_data$intersects_city == 1]
  
  # Performing t-test
  t_test_result <- t.test(group1, group2)
  
  # Calculating difference in means and confidence interval manually
  diff_means <- mean(group1) - mean(group2)
  se <- sqrt(var(group1)/length(group1) + var(group2)/length(group2))
  alpha <- 0.05
  t_critical <- qt(1 - alpha/2, df = length(group1) + length(group2) - 2)
  conf_interval <- c(diff_means - t_critical * se, diff_means + t_critical * se)
  
  mean_result_df <<-  rbind(mean_result_df, data.frame(
      Difference_in_Means = diff_means,
      Lower_CI = conf_interval[1],
      Upper_CI = conf_interval[2],
      behavior = x
  ))
  #print(paste0("Done with ", x))
  
}, c(selected_behaviors, selected_interests[4:14]), c(behaviors_id, interests_id[4:14]))
p_a1 <- ggplot(lr_result_df, aes(x = behavior, y = Coefficient, ymin = Lower_CI, ymax = Upper_CI)) +
              geom_pointrange() +  # Adds the points for coefficients and vertical lines for confidence intervals
              geom_hline(yintercept = 0, linetype = "dashed", color = "red") +  # Adds a dashed red line at zero
              coord_flip() +  # Makes the plot horizontal
              theme_minimal() +  # Applies a minimalistic theme
              labs(title = "Regression Coefficient with City Distance with 95% CI",
                   x = "Behavior", y = "Coefficient") + 
              theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
              annotate("text", x = min(lr_result_df$behavior), y = max(lr_result_df$Coefficient)*1.1, label = "Less Urban", vjust = 3) +
              annotate("text", x = min(lr_result_df$behavior), y = min(lr_result_df$Coefficient)*1.9, label = "More Urban", vjust = 3)

p_a2 <- ggplot(mean_result_df, aes(x = behavior, y = Difference_in_Means, ymin = Lower_CI, ymax = Upper_CI)) +
            geom_pointrange() +  # Adds the points for coefficients and vertical lines for confidence intervals
            geom_hline(yintercept = 0, linetype = "dashed", color = "red") +  # Adds a dashed red line at zero
            coord_flip() +  # Makes the plot horizontal
            theme_minimal() +  # Applies a minimalistic theme
            labs(title = "Difference in Mean with 95% CI",
                 x = "Behavior", y = "Difference in Means") + 
            theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
            theme(axis.text.y = element_blank()) +
            annotate("text", x = min(mean_result_df$behavior), y = -0.02, label = "More Urban", vjust = 3) +
            annotate("text", x = min(mean_result_df$behavior), y = 0.06, label = "Less Urban", vjust = 3)


ggarrange(p_a1, p_a2, nrow = 1, widths = c(0.6, 0.4))